Description

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged on every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers’ and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

Objective

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not
  3. Optimize the model using appropriate techniques
  4. Generate a set of insights and recommendations that will help the bank

Data Dictionary

Problem Statement

Thera bank need to come up with a classification model that will help bank improve their services so that customers do not renounce their credit cards.

Import necessary Libraries

Overview of the dataset

Displaying the first 5 and last 5 rows of the data set

Viewing the shape of the dataset

Check the Datatypes and columns of the dataset

Summary of the dataset

Fixing the Datatypes

Exploratory Data Analysis (EDA)

Univariate Analysis

Customer_Age

Observations on Months_on_book

Credit_Limit

Total_Revolving_Bal

Avg_Open_To_Buy

Total_Amt_Chng_Q4_Q1

Observations on Total_Trans_Amt

Total_Trans_Ct

Observations on Total_Ct_Chng_Q4_Q1

Avg_Utilization_Ratio

Attrition_Flag

Gender

Dependent_count

Education_Level

30.9% customers are graduates followed by 19.9% who has High School education. We have 15% showing as Unknown .We need to consider this as missing value and treat it later

Marital_Status

This graph can be explained as 46.3% of customers who are married.38.9% Single and 7.4% Divorced. We see 7.4% of customers who are unknown which we can treat them as missing values

Income_Category

Card_Category

Total_Relationship_Count

Months_Inactive_12_mon

Contacts_Count_12_mon

Bivariate Analysis

Bivariate Analysis of Attrition_Flag with categorical variables

Gender and Attrition

Dependent count and Attrition

This graph can be read as the more attrition for customers who have 3 or 4 dependants.The attrition for customers with 0,1,2,5 remains almost the same

Educational Level and Attrition

Marital Status and Attrition

In this graph, there is not much attrition between the customers in their Marital Status.

Income Category and Attrition

Card Category and Attrition

Total Relationship Count and Attrition

Months Inactive for 12 months and Attrition

Contacts Count 12 months and Attrition

This shows that Attrition is higher when the higher the number of contacts with the Bank in the last 12 months.

Bivariate Analysis of Attrition_Flag with continuous variables

Observations

Heatmap

Pair Plot

From the graph you can see that there is positive correlation for Customer_Age and Months_on_Book and also another positive correlation between Avg_Open_to_Buy and Credit_Limit

Insights Based On EDA

General Observations Based on EDA

Outlier Detection

This reveals that the total transaction count values of 138 and 139 does not really seem to be that extreme compared to the rest of the transaction counts

Just like above the ages of 70 and 73 is also not really an extreme value compared to the other Age values

There are really no values which are exceeding significantly to be called extreme values

Presently, I do not see any significant outlier that is not acceptible for the dataset. Hence I will not be treating them at this point.

Looking for unique values in the dataset

The unknown values in the columns: Education_Level, Marital_Status and Income_Category which can be treated as missing values

Missing Value Detection & Treatment

Missing Value Treatment

This reaveals that the values are encoded

Splitting the Dataset

Replacing the Missing Values using KNN Imputer

Checking inverse mapped values/categories

Now the inverse mapping has returned original tables

Encoding categorical variables

After encoding there are 47 columns

Model Building

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will renounce the credit card services but does not renounce- Loss of resources
  2. Predicting a customer will not renounce the credit card services but leaves - Loss of income

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Logistic Regression

Now we are going to improve model performance by up and downsampling the data and regularise the model if need be.

Now we will evaluate the model performance by using KFold and the cross_val_score

From the boxplot above it shows that the performance on the training set varies between 0.27 and 0.55 and checking the performance on the test data.

Now the logistic regression has performed general logsitic regression on both test and training set.

Oversampling train data using SMOTE

Logistic Regression on Oversampled Data

Now we are seeing if we can improve the model performance using oversampled data

Evaluate model performance using KFold and cross_val_score

Initially we evaluated the model but however it showed that the recall was very low hence the need to improve the performance of the model. Now we are going to evaluate the model using KFold and cross_val_score to see if the model is performing better.

Observation

Now we are going to check the performance on the test set

Observations

Since SMOTE did not really improve my model's performance I believe we can try another method which is

  1. Regularisation to check if overfitting can be reduced
  2. Also undersampling the training set to handle the imbalance between the classes and check for improvement in the model performance

Regularization

Now let us try Undersampling the model using SMOTE

Undersampling train data using SMOTE

Logistic Regression on undersampled data

Evaluate model performance using KFold and cross_val_score

Now to check on the performance on the test set with the confusion matrix

Now Logistic regression will now be able to defferentiate better between the positive and the negative classes

Model Comparison -Logistic Regression

The logistic regression model that had Undersampled data provided a generalised performance which also provided a really good performance on the recall on the test data.

Finding the coefficients

Converting coefficients to odds

General Observations and Conclusion for Logistic Regression

Likewise, the odds of a customer attriting can be calculated based on the coefficient values of other attributes

Model Building

Bagging and Boosting

Combining all the models in one code to provide a quick and instant result

From the graph we can deduce that XGBoost is giving the highest cross-validated recall then Gradient Boosting then AdaBoost

Hyperparameter Tuning using Grid search & Random search for all models

Now using the pipelines in hyperparamter tuning we would make use of StandardScaler to help tune the models using GridSearchCV and RandomizedSearchCV. Finally, compare the performance of the models to see which is better

The use of make_piplene function instead of Pipeline to create a pipline

We are going to create two functioms to calculate different metrics and confusion matrix and also to prevent having to do the code many times for each model.

AdaBoost

GridSearchCV

The test recall is similar to the cross validated recall and there is slight overfitting

RandomizedSearchCV

Random Forest

GridSearchCV

RandomizedSearchCV

XGBoost

GridSearchCV

RandomizedSearchCV

Decision Tree Classifier

GridSearchCV

RandomizedSearchCV

Bagging Classifier

GridSearchCV

RandomizedSearchCV

Comparing all models for Performance and Time Taken

Feature Importance-XGBoost

Total_Tran_Ct is the most important feature followed by Total_Trans_Amt & Total_Revolving_Bal

Actionable Insights & Recommendations

Business Recommendations and Insights